Predicting Flight Delays with sklearn

In this notebook, we will be using features we've prepared in PySpark to predict flight delays via regression and classification.



In [1]:

    
import sys, os, re
sys.path.append("lib")
import utils

import numpy as np
import sklearn
import iso8601
import datetime
print("Imports loaded...")









    



Imports loaded...

Load and Inspect our JSON Training Data



In [2]:

    
# Load and check the size of our training data. May take a minute.
print("Original JSON file size: {:,} Bytes".format(os.path.getsize("../data/simple_flight_delay_features.jsonl")))
training_data = utils.read_json_lines_file('../data/simple_flight_delay_features.jsonl')
print("Training items: {:,}".format(len(training_data))) # 5,714,008
print("Data loaded...")









    



Original JSON file size: 1,676,709,758 Bytes
Training items: 5,714,008
Data loaded...



In [3]:

    
# Inspect a record before we alter them
print("Size of training data in RAM: {:,} Bytes".format(sys.getsizeof(training_data))) # 50MB
print(training_data[0])









    



Size of training data in RAM: 50,897,424 Bytes
{'DayOfWeek': 4, 'ArrDelay': 13.0, 'Dest': 'DFW', 'FlightDate': '2014-12-31T16:00:00.000-08:00', 'Carrier': 'AA', 'Distance': 569.0, 'FlightNum': '1024', 'CRSArrTime': '2015-01-01T10:10:00.000-08:00', 'CRSDepTime': '2015-01-01T07:30:00.000-08:00', 'DayOfMonth': 1, 'DayOfYear': 1, 'DepDelay': 14.0, 'Origin': 'ABQ'}

Sample our Data



In [4]:

    
# We need to sample our data to fit into RAM
training_data = np.random.choice(training_data, 1000000) # 'Sample down to 1MM examples'
print("Sampled items: {:,} Bytes".format(len(training_data)))
print("Data sampled...")









    



Sampled items: 1,000,000 Bytes
Data sampled...

Vectorize the Results (y)



In [5]:

    
# Separate our results from the rest of the data, vectorize and size up
results = [record['ArrDelay'] for record in training_data]
results_vector = np.array(results)
print("Results vectorized size: {:,}".format(sys.getsizeof(results_vector))) # 45,712,160 bytes
print("Results vectorized...")









    



Results vectorized size: 8,000,096
Results vectorized...

Prepare Training Data



In [6]:

    
# Remove the two delay fields and the flight date from our training data
for item in training_data:
  item.pop('ArrDelay', None)
  item.pop('FlightDate', None)
print("ArrDelay and FlightDate removed from training data...")









    



ArrDelay and FlightDate removed from training data...



In [7]:

    
# Must convert datetime strings to unix times
for item in training_data:
  if isinstance(item['CRSArrTime'], str):
    dt = iso8601.parse_date(item['CRSArrTime'])
    unix_time = int(dt.timestamp())
    item['CRSArrTime'] = unix_time
  if isinstance(item['CRSDepTime'], str):
    dt = iso8601.parse_date(item['CRSDepTime'])
    unix_time = int(dt.timestamp())
    item['CRSDepTime'] = unix_time
print("CRSArr/DepTime converted to unix time...")









    



CRSArr/DepTime converted to unix time...

Vectorize Training Data with `DictVectorizer`



In [8]:

    
# Use DictVectorizer to convert feature dicts to vectors
from sklearn.feature_extraction import DictVectorizer

print("Sampled dimensions: [{:,}]".format(len(training_data)))
vectorizer = DictVectorizer()
training_vectors = vectorizer.fit_transform(training_data)
print("Size of DictVectorized vectors: {:,} Bytes".format(training_vectors.data.nbytes))
print("Training data vectorized...")









    



Sampled dimensions: [1,000,000]
Size of DictVectorized vectors: 87,539,400 Bytes
Training data vectorized...

Prepare Experiment by Splitting Data into Train/Test



In [9]:

    
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
  training_vectors,
  results_vector,
  test_size=0.1,
  random_state=43
)
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)
print("Test train split performed...")









    



(900000, 7436) (100000, 7436)
(900000,) (100000,)
Test train split performed...

Train our Model(s) on our Training Data



In [10]:

    
# Train a regressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import median_absolute_error, r2_score
print("Regressor library and metrics imported...")









    



Regressor library and metrics imported...



In [11]:

    
regressor = LinearRegression()
print("Regressor instantiated...")









    



Regressor instantiated...



In [12]:

    
from sklearn.ensemble import GradientBoostingRegressor

regressor = GradientBoostingRegressor
print("Swapped gradient boosting trees for linear regression!")

# Lets go back for now...
regressor = LinearRegression()
print("Swapped back to linear regression!")









    



Swapped gradient boosting trees for linear regression!
Swapped back to linear regression!



In [13]:

    
regressor.fit(X_train, y_train)
print("Regressor fitted...")









    



Regressor fitted...

Predict Using the Test Data



In [14]:

    
predicted = regressor.predict(X_test)
print("Predictions made for X_test...")









    



Predictions made for X_test...

Evaluate and Visualize Model Accuracy



In [15]:

    
from sklearn.metrics import median_absolute_error, r2_score

# Median absolute error is the median of all absolute differences between the target and the prediction.
# Less is better, more indicates a high error between target and prediction.
medae = median_absolute_error(y_test, predicted)
print("Median absolute error:    {:.3g}".format(medae))

# R2 score is the coefficient of determination. Ranges from 1-0, 1.0 is best, 0.0 is worst.
# Measures how well future samples are likely to be predicted.
r2 = r2_score(y_test, predicted)
print("r2 score:                 {:.3g}".format(r2))









    



Median absolute error:    9.68
r2 score:                 0.828



In [111]:

    
# Plot outputs
import matplotlib.pyplot as plt

# Cleans up the appearance
plt.rcdefaults()

plt.scatter(
  y_test,
  predicted,
  color='blue',
  linewidth=1
)
plt.grid(True)

plt.xticks()
plt.yticks()

plt.show()

Predicting Flight Delays with sklearn

Load and Inspect our JSON Training Data

Sample our Data

Vectorize the Results (y)

Prepare Training Data

Vectorize Training Data with DictVectorizer

Prepare Experiment by Splitting Data into Train/Test

Train our Model(s) on our Training Data

Predict Using the Test Data

Evaluate and Visualize Model Accuracy

Vectorize Training Data with `DictVectorizer`